In the last tutorial (https://michaeldgarber.github.io/teach-r/1-dplyr-nyt-covid.html), we learned the basics of using dplyr to wrangle a state-level dataset on COVID-19 cases and deaths. In this tutorial, we will build on that work to calculate cumulative and daily incidence per population. To grab population data, we’ll load census data using tidycensus. We also introduce sf, which is the workhorse library for wrangling spatial data in R. We’ll join the census data with our COVID-19 dataset and do some interactive mapping with mapview and finish with some ggplot2 graphs. To that end, let’s make sure we have the following packages installed:

library(tidyverse)
library(tidycensus)
library(mapview)
library(sf)
library(viridis) #used to change color palettes
library(scales) #Used to help with the ggplot date axis below.

Re-generate the NYT COVID-19 dataset we were working with before.

This simply repeats the code that we used in the previous tutorial to give us the same dataset we were working with.

Load the state-level data from the NYTimes GitHub repository the same way we did before.

nyt_state_url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"
nyt_state_data = nyt_state_url %>% 
  url() %>% 
  read_csv()

And this code reproduces our previous data wrangling where we calculated the daily number of incident cases.

nyt_state_wrangle = nyt_state_data %>% 
  group_by(state) %>% 
  arrange(date) %>% 
  rename(
    cases_cumul = cases,
    deaths_cumul = deaths
  ) %>% 
  mutate(
    cases_cumul_day_before = lag(cases_cumul),
    cases_incident = cases_cumul - cases_cumul_day_before,
    
    deaths_cumul_day_before = lag(deaths_cumul),
    deaths_incident = deaths_cumul - deaths_cumul_day_before
  ) %>% 
  ungroup() %>% 
  arrange(state, date)

Load state-level census data using tidycensus

Tidycensus is a convenience package that allows census data to be downloaded into R. The online material on tidycensus has great examples: https://walker-data.com/tidycensus/index.html.

Use of tidycensus requires a key to access the U.S. Census API, which can be obtained for free here: http://api.census.gov/data/key_signup.html. The first time you use the key on your computer, enter this line of code.

census_api_key("your_census_key_here", install=TRUE)

Tidycensus lets the user specify

  • the survey (American Community Survey or U.S. Census),
  • the corresponding time period for that survey, the target geographic resolution and extent,
  • the census variables to be downloaded, and
  • whether or not the returned file will include the corresponding spatial data for the geographic units.

By specifying geometry=TRUE, an sf object will be returned representing the object for the chosen geography. If geometry=FALSE, then only the aspatial data will be returned. sf (or simple feature) objects have become the standard data structure for managing spatial data in R. We will cover sf objects more in a separate tutorial (link TBD). This chapter by Lovelace, Nowosad, and Muenchow is a great place to start to learn more about sf: https://geocompr.robinlovelace.net/spatial-class.html

One great aspect of sf objects–as compared with the sp package, the previous standard–is that spatial data can be handled as you would any other rectangular dataframe. An sf object is like a data.frame or tibble with an added column representing the geometry for that observation. The data structure of sf makes it straightforward to integrate spatial data into a typical data-wrangling workflow, during which you might add new variables, filter on a variable, or merge with additional data.

Load state-level population data

Without further adieu, let’s run the tidycensus function get_acs() to return state-level population data from ACS:

#Run this to save the data locally for speed.
options(tigris_use_cache = TRUE) 

#One way I keep track of whether an object has geometry associated with it to add a _geo suffix to the name.
pop_by_state_geo =get_acs(
  year=2019,
  variables = "B01001_001", #total population
  geography = "state",
  geometry = TRUE #to grab geometry
)
## Getting data from the 2015-2019 5-year ACS

A convenient way to search through the census variables is to use the load_variables function, as described here: https://walker-data.com/tidycensus/articles/basic-usage.html.

vars_acs_2019 <- load_variables(2019, "acs5", cache = TRUE)
View(vars_acs_2019)

Confirm the data are an sf object (simple feature) and explore the variable names.

class(pop_by_state_geo)
## [1] "sf"         "data.frame"
pop_by_state_geo
## Simple feature collection with 52 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.1489 ymin: 17.88328 xmax: 179.7785 ymax: 71.36516
## Geodetic CRS:  NAD83
## First 10 features:
##    GEOID                 NAME   variable estimate moe
## 1     12              Florida B01001_001 20901636  NA
## 2     30              Montana B01001_001  1050649  NA
## 3     27            Minnesota B01001_001  5563378  NA
## 4     24             Maryland B01001_001  6018848  NA
## 5     45       South Carolina B01001_001  5020806  NA
## 6     23                Maine B01001_001  1335492  NA
## 7     15               Hawaii B01001_001  1422094  NA
## 8     11 District of Columbia B01001_001   692683  NA
## 9     44         Rhode Island B01001_001  1057231  NA
## 10    31             Nebraska B01001_001  1914571  NA
##                          geometry
## 1  MULTIPOLYGON (((-80.17628 2...
## 2  MULTIPOLYGON (((-116.0491 4...
## 3  MULTIPOLYGON (((-89.59206 4...
## 4  MULTIPOLYGON (((-76.05015 3...
## 5  MULTIPOLYGON (((-79.50795 3...
## 6  MULTIPOLYGON (((-67.32259 4...
## 7  MULTIPOLYGON (((-156.0608 1...
## 8  MULTIPOLYGON (((-77.11976 3...
## 9  MULTIPOLYGON (((-71.28802 4...
## 10 MULTIPOLYGON (((-104.0534 4...

A quick mapview

Another essential package in the sf workflow is mapview (https://r-spatial.github.io/mapview/index.html), which creates quick interactive maps. In just one line, we can interactively map the ACS state-level population data we just loaded.

mapview(pop_by_state_geo)

Wrangle the census data further to prep for the join with the NYT state-level COVID-19 data.

Let’s clean up the output from tidycensus to make it easier to link with the state-level COVID-19 data.

names(pop_by_state_geo)
## [1] "GEOID"    "NAME"     "variable" "estimate" "moe"      "geometry"

The variable corresponding to the state name is called ‘NAME’ in the tidycensus output. Rename it to ‘state’ so it can be more easily joined with the NYT data. Also note that there is a column with the variable name (‘variable’) and another column for its ‘estimate’. It’s formatted this way because the default output from tidycensus is long-form data. That is, we could have downloaded several variables for each state, and each variable-state combination would receive its own row. We can consolidate these fields by dropping the variable column and renaming ‘estimate’ to ‘population’. Finally, because GEOID may vary depending on the geographic unit, let’s explicitly remember it’s a two-digit FIPS code and call it ‘fips_2digit.’

names(nyt_state_wrangle)
## [1] "date"                    "state"                  
## [3] "fips"                    "cases_cumul"            
## [5] "deaths_cumul"            "cases_cumul_day_before" 
## [7] "cases_incident"          "deaths_cumul_day_before"
## [9] "deaths_incident"

We can wrangle this sf object the same way we would a tibble. For example, here we use a pipe (%>%) and the dplyr verbs rename(), select(), and mutate().

pop_by_state_wrangle_geo = pop_by_state_geo %>%
  rename(
    fips_2digit = GEOID,
    state = NAME,
    population = estimate
  ) %>% 
  dplyr::select(-moe, -variable) %>% 
  #To simplify some of the maps below, create an indicator variable corresponding to the Continental 48.
  #Do the same type of operation using two different types of syntax, the  %in% operator and the or operator.
  mutate(
    continental_48 = case_when(
      state %in% c("Alaska", "Hawaii", "Northern Mariana Islands", "Guam", "Virgin Islands", "Puerto Rico") ~ 0,
      TRUE ~ 1)    
    ) 

names(pop_by_state_wrangle_geo)
## [1] "fips_2digit"    "state"          "population"     "geometry"      
## [5] "continental_48"

Use mapview to visualize population.

To confirm the population data are as we expect, let’s map the state’s populations using mapview. The pipe operator, %>%, can also be used before a mapview() call. In this code, the sf object, pop_by_state_geo, and all subsequent changes we make it to it, e.g., after having used filter() here, are “piped” into the mapview() function. The zcol argument specifies which variable to be visualized. As you may notice, mapview defaults to the viridis palette.

pop_by_state_wrangle_geo %>%
  filter(continental_48==1) %>%   #Limit to continental 48 so the default zoom is more zoomed in.
  mapview(zcol = "population" ) #Note the object argument is implied by the pipe operator.